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ABSTRACT 

We present a new approach to estimating the interdepen¬ 
dence of industries in an economy by applying data science 
solutions. By exploiting interfirm buyer-seller network data, 
we show that the problem of estimating the interdependence 
of industries is similar to the problem of uncovering the la¬ 
tent block structure in network science literature. To es¬ 
timate the underlying structure with greater accuracy, we 
propose an extension of the sparse block model that incor¬ 
porates node textual information and an unbounded num¬ 
ber of industries and interactions among them. The latter 
task is accomplished by extending the well-known Chinese 
restaurant process to two dimensions. Inference is based on 
collapsed Gibbs sampling, and the model is evaluated on 
both synthetic and real-world datasets. We show that the 
proposed model improves in predictive accuracy and suc¬ 
cessfully provides a satisfactory solution to the motivated 
problem. We also discuss issues that affect the future per¬ 
formance of this approach. 

CCS Concepts 

• Computing methodologies —¥ Latent Dirichlet allo¬ 
cation; Latent Dirichlet allocation; Latent variable 
models; • Applied computing —> Economics; 


row in the table represents the distribution of a producer’s 
output to other industries and each column represents the 
composition of inputs required by a certain industry to pro¬ 
duce its output. The table is one of the fundamental statis¬ 
tics that describe the state of a macroeconomy assembled by 


governments worldwide and international organizations 30 


15] [29]. It is used by academics, businessmen and govern¬ 
ment officials to capture the circular flow of transactions in 
an economy. 

The basic methodology for assembling the input-output 
table was developed in the late 1930s. Although various 
developments have been made, the basic methodology re¬ 
mains the same. In this paper, we provide new methodology 
to solve the problem of summarizing the interdependence 
of industries in an economy. The approach is, in essence, 
a dimensionality reduction technique that uses a graphi¬ 
cal model to capture the dependence among multi-source 
datasets (i.e., the interfirm buyer-seller network and short 
textual information that summarizes firms’ main business 
lines). Although the motivation for this paper might be un¬ 
familiar to the community, it shows the strength of using 
familiar machine learning techniques to answer real-world 
questions. Furthermore, it creates an opportunity to explore 
new research challenges that concern economic networks. 
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network, graphical model, Bayesian nonparametric statis¬ 
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1. INTRODUCTION 
1.1 General Introduction 

The input-output table is a matrix that summarizes the 
interdependence of industries in an economy [20 . It is con¬ 
cerned with the activity of industries that buy goods pro¬ 
duced by other industries for their own production. Each 
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1.2 Comaprison to the traditional approach 

The interdependence of industries is formed by firms’ trade 
relationships of buying goods from other firms as an input to 
their own production. Thus, discarding all the obstacles to 
gathering data, the ideal dataset for which we want to base 
the analytics starts at the firm level. One example of this 
ideal dataset is summarized in Table [l] This dataset con¬ 
tains all the information concerning which firm bought what 
goods from which other firm to produce what goods, together 
with the transaction date, price and volume of the purchased 
goods in an economy. However, because of the limitations of 
current information gathering technology and privacy issues 
concerning firms’ business strategies, currently, it is impossi¬ 
ble to gather these ideal data. Thus, to overcome this issue, 
the traditional input-output table is based on surveys and 
an interpretation of other primary and secondary economic 
data to gather information concerning how much of what 
goods is bought to produce particular goods in an industry 
(Table [2|. Together with the list of industries and goods 
supposed to be produced in an economy, several rounds of 
meetings are held by professionals and the coefficients in the 
input-output transaction table are determined. 



Table 3: Basic data used in this paper. It summarizes the 
interfirm relation among firms. 


Buyer 

Seller 

Firm A 
Firm A 
Firm B 

Firm B 
Firm C 
Firm D 


An important factor to notice in the traditional approach 
is the absence of information concerning firms’ trade rela¬ 
tionships (columns one and four in Table [l] and Table [3|. 
The main reason for this ignorance is due to data gathering 
issues. However, at the current time, there are information 
providers that gather this type of data in various ways (i.e. 
questionnaire, press release, customs). Moreover, informa¬ 
tion concerning firms, such as text describing firms’ main 
business lines, web pages and detailed industry classifica¬ 
tions, are also becoming increasingly available. Under these 
emerging changes in the data environment, a new way to 
approach the problem is needed. 

1.3 Contributions 

In this paper, we first propose a model to estimate the in¬ 
terdependence of industries using interfirm buyer-seller net¬ 
work information and short textual information concerning 
firms’ main business lines. We then provide a quantitative 
comparison between the predictive performance of the model 
and previous related machine learning models using both 
synthetic and real datasets. The quantitative experimental 
results are followed by a qualitative result that demonstrates 
how the proposed model could summarize the interdepen¬ 
dence of industries in an economy. We also provide direct 
comparison of the estimated interindustry structure from 
our approach to the input-output table. 

The proposed approach is distinct from the traditional 
approach in three ways. First, it uses complementary in¬ 
formation not used in the traditional approach (compare 
Tables [5] and [3]. This is not an argument regarding which 
method is better. Instead, this paper asserts the useful¬ 
ness of exploiting different parts of the ideal dataset not 
previously exploited ultimately being able to leverage more 
information than the traditional method. Second, the unsu¬ 
pervised nature of the proposed approach makes it possible 
to estimate the interdependence of industries from the bot¬ 
tom up, thereby automatically estimating the industries in¬ 
volved in an economy. This is in contrast to the traditional 
approach, where all the industries supposed to be operat¬ 
ing in an economy have to be predefined, which might be 
problematic when trying to estimate an unconventional in¬ 
dustry structure. Third is extensibility. The strength of the 
graphical modeling approach presented in this paper origi¬ 
nates in its modularity. Although we only use node textual 
information to leverage the understanding of the network 
in this paper, it is easy to extend the model to incorporate 
additional information, such as that summarized in Table |Tj 
Moreover, geographic information and multiplex relational 
information, in addition to interfirm buyer-seller informa¬ 
tion, could also be incorporated, which makes it possible to 
further exploit the various sources of information emerging 
from the changes in the data environment. 


1.4 Related works concerning our modelling 


The proposed model used to estimate the interdependence 
of industries in this paper is an extension of the sparse block 
(SB) model of Parkkinen et al. [24], which jointly mod¬ 
els network information and node textual information. The 
joint modeling of network and textual information (i.e., in¬ 
terfirm buyer-seller relationships and a short line of text 
summarizing each firm’s main business line) is an important 
step to effectively estimate the interdependence of indus¬ 
tries. Many authors 28] 23 have considered the importance 
of using extra information to leverage our understanding of 
the network. We propose two models to enable the SB ap¬ 
proach to jointly simulate network and node textual infor¬ 
mation. Of these, one can be regarded as a direct counter¬ 
part of the relational topic model (RTM) [6] and is related 
to the link-LDA method [22], These models combine latent 
Dirichlet allocation (LDA) p] with the mixed membership 
stochastic block (MMSB) model of Airoldi et al. 1 and are 
widely used in the literature. 

The advantage of using the SB model instead of its MMSB 
counterpart as the underlying generative process for network 
formation is its ability to exploit the SB structure of the 
network directly. This view is shared with previous work 
dealing with the SB model [3 ,|25[ 11,42] [26]. The fact that 


it is better to assume that most industry pairs have no in¬ 
teractions also originates from the specific dataset used in 
this paper. In the dataset, each firm is requested to name 
up to five buyers of its products and suppliers of the in¬ 
termediate goods used in its own production. This scheme 
corresponds to the fixed rank nomination scheme in social 
network analysis (c.f. a friendship network) 13]. For this 
scheme, all minor relationship^] would be ignored in the 
network, which makes it possible to estimate only the major 
interdependence among industries. The SB model enables 
us to exploit this SB structure more efficiently. 

Additionally, the generating process of the SB model only 
models existing links (i.e. edge list) and ignores links that 
are not fomed. Compared with the MMSB model, where 
both the existence and nonexistence of links are modeled, 
this saves many computations when the network structure 
is sparse. This makes it suitable for large sparse graphs, 
such as the interfirm networks modelled in this paper. 

The simultaneous estimation of the number of industries 
and active interactions among them is implemented by em¬ 
ploying a two-dimensional extension of the Chinese restau¬ 
rant process [27]. Our motivation for extending the Chinese 
restaurant process to two dimensions originates from a need 
to model the SB structure among industries. New link pat¬ 
terns among industries could either be generated from (i) 
new emerging industries or (ii) new link patterns emerging 
from already existing industries. Both types of link forma¬ 
tion (i.e., exogenous and combinatorial) are important to 
the innovation process of an interfirm buyer-seller network, 
and we use this as the prior process in our model. 

There is one disadvantage of our extended model. The 
Chinese restaurant process exhibits an important invariance 
property called exchangeability [8], which makes inference 
via Markov chain Monte Carlo (MCMC) sampling straight¬ 
forward. However, this is no longer the case in our two- 
dimensional extension. This is a well-known issue when 


1 For instance, a firm manufacturing cars would not list a 
stationery store as one of its most important suppliers. 








Table 1: Ideal data describing firms’s trade relationships of buying goods from other firms as an input to their own production. 


Buyer 

Purpose 

Goods 

Seller 

Price 

Volume 

Date 

Firm A 

To produce car 

tyre 

Firm B 

50 

2 

2015.9.1 

Firm A 

To produce car 

glass 

Firm B 

40 

5 

2015.9.1 

Firm A 

To produce car 

aluminum 

Firm C 

60 

3 

2015.9.1 

Firm B 

To produce tyre 

rubber 

Firm D 

2 

20 

2015.9.1 


Table 2: Basic data used to assemble the input-output table 


Purpose 

Goods 

Total transaction 

To produce car 

tyre 

100 

To produce car 

glass 

200 

To produce car 

aluminum 

180 

To produce tyre 

rubber 

40 


taking a sequential formulation (predictive distribution) ap¬ 
proach to model the prior process [27| |31[ [2j. One strategy 
is to follow previous works [19] [2] and directly process non¬ 
exchangeable priors by developing an appropriate inference 
methodology. However, we show that the break in the in¬ 
variance property is only slight, and a minor modification to 
the joint distribution suffices to recover the invariance prop¬ 
erty. In the proposed model, we use the joint distribution 
with this approximation. After recovering exchangeability 
via approximation, inference is performed using collapsed 
Gibbs sampling. This is in line with previous Bayesian non- 
parametric models [9} |16| . 

1.5 Organization of the paper 

The remainder of this paper is organized as follows: In 
Section 2, we introduce the two basic models and illustrate 
their inference strategy. In Section 3, we present the two- 
dimensional Chinese restaurant process and demonstrate how 
the invariance property breaks down and is repaired in the 
joint distribution. In Section 4, we combine the joint dis¬ 
tribution, which is motivated by the two-dimensional Chi¬ 
nese restaurant process, with one of the two basic models 
described in Section 2. In Section 5, we evaluate the per¬ 
formance of the proposed methods. In the final sectiomwe 
discuss further related work and present the conclusion 2 ] 


2. BASIC MODELS 

2.1 Sparse block model with text 


To jointly model network and node textual information, 
we use a combination of the SB model 24 and LDA [5]. 
Figure [la] shows the plate diagram of the model. The lower 
component corresponds to the SB model and the upper com¬ 
ponent represents LDA. The generative process is comprised 
of the following two stages. 

1. Generate edge list: 

(1) Sample 8 ~ Dirichlet(a), where 8 denotes the multi¬ 
nomial distribution over industry pair labels. The dimension 
of this multinomial is A' 2 . 

(2) Sample each c fik ~ Dirichlet{j3), where each (j>k de¬ 
notes the multinomial distribution over firms. There are M 
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firms in the network, and from </>*,, we can sample a firm 
from industry number k. 

(3) For each edge, first sample an industry pair ( 21 , 22 ) ~ 
Multinomial (8) and then sample firms from each industry 
via i ~ Multinomial and j ~ Multinomial(<f > Z2 )■ 
This completes the generation of an edge list. There are 
Ni edges in total. 

2. Generate word list: 

(1) Sample each tjik ~ Dirichlet( 7 ), where each ipk de¬ 
notes the multinomial topic distributions. There are W 
words in the vocabulary list. 

(2) For each firm i in the network, consider the distribu¬ 
tion of an industry number involving i as either a sender 
(i.e., Z \) or receiver (i.e., z 2 ). The distribution of indus¬ 
try numbers for a given firm i in the edge list is denoted 
by Xi. Note that this could be approximated as Xik ~ 

— fc 0+<n 2=fc ,j 1 , 1 

c pki ft.,,, Jk - 1 -T— -, where q Zl =k,j counts the num- 

ber of edges in which node j's industry number as a sender 
is k, q, 2=k j counts the number of edges in which node j’s 
industry number as a receiver is k, because 4>ki and x t k share 
the same numerator. To elaborate on this point, <f>ki could 
be calculated as the number of times i is labeled industry 
number k as either a sender or receiver divided by the num¬ 
ber of times industry number k appeared in the edge list, 
whereas Xik could be calculated by the number of times i 
is labeled industry number k as either a sender or receiver 
divided by the number of times i appeared in the edge list. 
Hence, the only difference between the two is the denomi¬ 
nator. 

(3) There are T; words in firm j’s short text. For each word 
position t in firm j’s text, sample ry ~ Multinom.ial(xi). 

(4) Finally, sample each word from the topic distribution 

ip rt . There are words in total in the entire corpus. 

Note that creating an edge directly from <f>k to Xi risks the 
separation of the multinomials into those that explain the 
edge list and the other word list. This separation was also 
noted by Chang and Blei [6] as a criticism of the link-LDA 
model [22]. Following previous work 6], we force parameter 
sharing by linking the sampled z directly to Xi. 

There are two previous works that are extremely close 
to our model. One is the block-LDA model of Balasubra- 
manyan and Cohen [ 3 ], which also combines the SB model 

















with LDA. However, for block-LDA, the focus is to model 
links between entities in the documents instead of the link 
between documents. In the application we consider (i.e., 
interfirm buyer-seller networks), node textual information 
is provided for each firm. Thus, we avoid using the block- 
LDA model. Another work that is similar to our model 


concerns topological feature classification 25 . The differ¬ 


ence between this model and the proposed model is that 
each node has a class label instead of textual information. 
For instance, if we used industrial classification rather than 
textual information, the approach of Peel [25] might have 
been more appropriate. However, because we do not wish 
to set the number of roles as the number of prescribed in¬ 
dustry classifications and use more information than simply 
the industry classification by exploiting textual information 
about each firm’s business, we also avoid using the topologi¬ 
cal feature-based classification. The likelihood of the model 


p{L, Z, W, R, i/j, <j>, 0\a, /3, 7 ) 


K 2 

D 0 n *+ a - 1 

Z= 1 


KM K W t- 

ini*'"*- n n 


k=1 i=1 


k= 1 w=l 


M K 


n 


M(3 + 5] j=\Qzi=k,j + Qz 2 =k,j \ r|, 
K/3 + Z,^_ 1 q zi= i } i T- q Z2= i t i 


(1) 


where M denotes the number of nodes, K denotes the num¬ 
ber of industries (i.e., topics), T, denotes the number of 
words in the textual information for node i, W denotes the 
number of unique words in the node textual information, D 
is a normalizing constant, n z denotes the number of times 
block pair z has been sampled, q zi= k,i counts the number 
of edges in which node i’s industry number as a sender is k, 
q Z2 =k,i counts the number of edges in which node i’s industry 
number as a receiver is k, ru = k, w denotes the number of 
word positions of node i that have word w and topic number 
k, r l k := = k,w) and the remaining quanti¬ 

ties denote multinomial distributions and hyperparameters. 
Note that the last term is an approximation that uses the 
fact that cbki = and xu- = share the same 

^ T, k f(k,z) 

numerator, as already noted above. 

The collapsed Gibbs sampler for each edge (i.e., p(za\.) 
where zq denotes the industry pair of the link we are sam¬ 
pling) and word position (i.e., p(ro|.) where ro denotes the 
industry of the word position we are sampling) could be de¬ 
rived by taking exactly the same step as that of LDA 10 . 
We omit the derivation to save space. 


2.2 Reversed sparse block model with node tex¬ 
tual information 

The generating process described above can be reversed. 
First, the word list is generated and then the inferred multi¬ 
nomial distributions are used to generate the edge list. Fig¬ 
ure [lb] shows the plate diagram of this model. The genera¬ 
tive process is as follows: 

1. Generate word list: 

(1) Sample each tpk ~ Dirichlet{ 7 ), where each ipk de¬ 
notes the multinomial topic distributions . There are W 
words in the vocabulary list. 


(2) For each firm i in the network, sample its topic pro¬ 
portion as 4>i ~ Dirichlet( 7 ). 

(3) There are T; words in firm i’s short text. For each word 
position t in firm i’s text, sample each topic as Zt ~ <j>%. 

(4) Sample each word Wt from topic distribution <j> zt . There 
are E^Ti words in total in the entire corpus. 

2. Generate edge list: 

(1) Sample 6 ~ Dirichlet(a), where 6 denotes the multi¬ 
nomial distribution over industry pair labels. The dimension 
of this multinomial is K 2 . 

(2) For all firms in the network, consider the distribution 
of the topic number (industry number) for words in each 
firm’s short text. The distribution of firms for a given in¬ 
dustry k is denoted by x k - This could be approximated 

as Xki ~ <;hk because, as before, < p ik and a: fci 

share the same numerator. There are M firms in the net¬ 
work. 

(3) For each edge, first sample the industry pair (yi, 2 / 2 ) ~ 
Multinomial (9) and then sample firms from each industry 
using i ~ Multinomial (x yi ) and j ~ Multinomial(x V2 ). 
This completes the generation of the edge list. 

Although this model is essentially the same as that de¬ 
scribed above, we also measure the predictive performance 
of this model because it closely resembles the RTM in which 
we start with LDA and generate the network structure using 
MMSB [ 6 ]. The sampler of this model could be derived in a 
similar manner. 


3. TWO-DIMENSIONAL CHINESE RESTAU¬ 
RANT PROCESS 

To enable the aforementioned model to process a poten¬ 
tially infinite number of industries and the interdependence 
among them, we introduce a new prior distribution by ex¬ 
tending the Chinese restaurant process. A schematic figure 
describing the process is shown in figure 2 . As with other 
Bayesian nonparametric models, we derived the distribution 
via defining a sequential process. There are other ways to 
define exactly the same joint distribution as that of Equa¬ 
tion (7). However, the following illustration, which adheres 
to a sequential formulation, makes it clear how the process 
distinguishes between the creation of new industry pairs in¬ 
volving new industries (i.e. figure 2 d) and the creation of 
new industry pairs using a combination of industries that 
already exist (i.e. figure 2c). Both types of innovation are 
well known to exist and thus, we derive the distribution from 
the sequential formulation. 

At the start of the enterprise, suppose that a pair of firms 
both in industry A establishes a trade relationship. The sec¬ 
ond pair of firms to arrive in the economy first tries to create 
a new industry B with probability 7 . If the pair succeeds 
in creating the new industry, the pair could both belong 
to industry B or the link could be classified as a combina¬ 
tion involving A (e.g., AB or BA). The ordering of pairs is 
important. If AB is chosen, this implies that the firm in 
industry A sells goods to the firm in industry B, and vice 
versa. If the second pair fails to create a relationship involv¬ 
ing a new industry, this pair could just follow the first pair 
and connect firms among industry A. Suppose now that the 
second pair succeeds in creating a new industry and links 
firms between industries AB. Under this scenario, the third 
pair to arrive in the economy now has a third option. This 
pair could either create a new link involving a new industry 







C or create a new link among industry pairs that nobody 
has created before (e.g., BB or BA), with some probability 
governed by the parameter a, or this pair could follow the 
first and second pairs and create links among industries ac¬ 
cording to the industry pairs’ popularity. The three distinct 
behavior is summarized in figure 2. 

To illustrate the process in more detail with an example, 
consider the following process. 

1. A trade relationship is established between two firms 
both in industry A. 

2. The second pair of firms succeeds in creating a new in¬ 
dustry and links are established among industries AB 
(i.e., firms in A sell to firms in B). 

3. The third pair of firms decides to follow the first pair 
and forms a new link among firms in industry A. 


for industry pairs that were first produced by combining 
existing industries and 


p(zi:N t = z\a,p) = 


p(Iz, 2 — 1) ' ' ' ( Iz,N z — 1) 

(Iz, 1 - 1 + p) ■ ■ ■ ( I z ,n z - 1 + rj) 
(n z - 1 + a) ■ ■ • (1 + a) 


(I Zy2 -l + K 2 Mz2) a)---(I z ,2-l + K 2 A 


A(z,N z ) 

1 


a) 


(5) 


K* - (K z - l) 2 


for industry pairs that were first produced by adding a new 
industry to the economy, where I z ., denotes the identifier of 
the ith entrepreneur who first created a link using industry z. 
Note that when p and a are sufficiently close to 0, K\^ z Nz ) a 
increases slowly compared with I z ,n z — 1, which makes it 
possible to approximate 


4. The fourth pair of firms decides to follow the first pair 
and forms a new link among firms in industry A. 

5. The fifth pair of firms determines that one could create 
a new link among firms both in industry B and decides 
to create a new link. 


_ {Iz,2 -!)•■• ( I z ,n z - 1) _ 

(T, 2 - 1 + K\ (lj2) a) • • ■ (Iz ,2 - 1 + K \ (z Nz) a) 1 j 

simply as 1. With this approximation, we determine the 
joint distribution of the process for TV; links, which can be 
written as 


6. The sixth pair of firms succeeds in creating a new link 
involving a new industry C that connects firms among 
industries CA (i.e., firms in C sell to firms in A). 

7. The seventh pair of firms decides to follow the second 
pair and forms a new link among firms among indus¬ 
tries AB (i.e., firms in A sell to firms in B). 

From the illustration above, the joint distribution of the 
above process can be written as 


p(z 1:Nl \a,rj) = D 


r (v)Ylfli r (nz +<*) 

T(Ni + rj)T(a) K2 


(V 

a 


n 


i 

k 2 — (k — l) 2 ’ 


(7) 


where D is a normalizing constant, K denotes the number of 
industries (i.e., topics) and n z denotes the number of times 
block pair z has been sampled. 


p(zi:i 0 |a, p) = 
4 


p 1 2 a + 1 3 a + 2 
l+p32 + p 4a+ 23 + p 4a+ 3 
a 77 1 6 a + 1 7 a + 3 


( 2 ) 


4 + r;4a + 45 + 7756 + 77 9a + 67 + ?7 9a + 7 


Note that p controls the probability of creating a new in¬ 
dustry while a controls the probability of creating new edges 
(i.e. Figure 2). It is important to note that exchangeabil¬ 
ity is broken because the denominator of the 4, 6, 8, 12,... 
terms, namely 


4a + 2 4a + 3 4a + 4 9a + 6 9a + 7 

depends on when the new industry was created. To summa¬ 
rize, based on the above sequential formulation, the exact 
timing for the creation of a new industry influences the prob¬ 
ability of industry pairs using already existing industries to 
emerge. Rearranging the above terms, the contribution of 
each component to the joint likelihood is given by 


p(zi :Nl 


z\a,p) 


Iz, 1 - 


= (Iz,l-l)(Iz,2 -!)•••(/». N,-l) 

(Iz, 1 - 1 + p) ' ' ' (Iz,N z -l+p) 
a n z — 1 + a 

l + Al M) a7, 2 -l + Al M a"' (4) 
1 + a 

U, 2 — 1 + KA(z,N z ) a 


4 . INFINITE SPARSE BLOCK WITH NODE 
TEXTUAL INFORMATION 

We use the joint distribution derived in the previous sec¬ 
tion as the prior distribution of the proposed model. The 
sampler for the infinite SB with node textual information 
(InfSBT) model is as follows: For a particular edge, we sam¬ 
ple from 


/ 

p(z 0 |.) oc 

V 


Ni-l +11 

a 

Ni- 1+17 

n _ 1 

JV(-1+jj (k+1) 2 -K 2 


(8) 


where Zo denotes the one pair of nodes which we are sam¬ 
pling and the terms in addition to the sampler of the indus¬ 
try pairs are exactly the same as those of the SBT model 
and hence, are omitted here to save space. The first branch 
corresponds to the case where an existing combination is 
sampled according to its popularity, the second branch cor¬ 
responds to the case where a new combination is sampled 
using already existing industry pairs and the final branch 
corresponds to the case where a new combination is sam¬ 
pled using a new industry as either the sender or receiver. 
The sampler for a particular word-topic pair is the same as 
mentioned previously, thus it is not reproduced here. 


5. RESULTS 





Table 4: Evaluation of the estimated block structure. 


Model 

NetAE 

Net VI 

TopicAE 

Topic VI 

SB 

268 

2.91 

NA 

NA 

SBT 

140 

1.53 

92 

1.12 

RevSBT 

246 

1.92 

94 

1.24 

InfSBT 

138 

1.64 

92 

1.16 


5.1 Synthetic data 

To demonstrate the performance of our model, we first 
use a synthetic dataset. The network consists of 70 nodes 
and 249 edges. The edges are randomly sampled from one 
of the 20 active interactions among 16 industries. The edge 
list is accompanied by a word list. Each node is associated 
with 0-12 words from the topic distributions (industries) in 
which they are involved. The word list is almost the same 
length as the edge list and consists of 230 words in total. Our 
goal is to simultaneously estimate the number of industries, 
underlying block structure and topic distribution from the 
randomly permuted colorless version of the network. 

Evaluation of the estimated block structure We 
first examine how well the proposed model is able to deter¬ 
mine the true block structure governing the synthetic data. 
For comparison, we also report the results given by the SB 
model [24]. For models other than the infinite version, we 
set the number of industries to 16, which is the true num¬ 
ber of blocks used to generate the dataset. Although the 
hyperparameters could be estimated using a maximum a 
posteriori estimate or full Bayesian approach, we set them 
to 0.05 for ease of computation. For all the finite mod¬ 
els, we perform 50,000 iterations, with the final realization 
used for evaluation. The performance is compared using 
two measures: variation of information (VI) [T§] and the 
absolute error (AE) between the true and estimated block 
structures. Every measure requires the ground truth net¬ 
work, and the latter measure requires the additional con¬ 
straint that the number of industries is the same as the true 
number of blocks. We report these two measures for both 
the edge list and word list. 

Table [4] reports the results. “Net” represents the esti¬ 
mation performance for the underlying block structure and 
“Topic” represents that for topic distributions. AE repre¬ 
sents the absolute error and VI represents the variation of 
information. It shows that the use of additional textual in¬ 
formation enables the proposed models to significantly out¬ 
perform the SB model. Moreover, there is little difference 
between the performance of the SB model with node tex¬ 
tual information and its reverse counterpart for estimating 
the topic distributions. However, for estimating the block 
structure, the SB model with node textual information is 
clearly superior to its reverse version. Furthermore, the infi¬ 
nite version performs almost as well as its finite counterpart. 

Predictive performance Next, we compare the predic¬ 
tive performance of the model for both the edge list and 
word list. Similarly to other probabilistic models, InfSBT 
defines a probability distribution over the given data. How¬ 
ever, compared with MMSB or RTM, which explicitly model 
the existence and nonexistence of a link between two nodes, 
the SB model and InfSBT only model the probability of a 


Table 5: Average score for edge list prediction 


Data 

Null 

MMSB 

RT 

SB 

InfSBT 

Synthetic 

1827 

1773 

1623 

1111 

794.6 

Real 

24572 

8228 

18328 

7934 

7189 


Table 6: Average AUC score for link prediction 


Data 

MMSB 

RT 

SB 

InfSBT 

Synthetic 

Real 

0.557 

0.614 

0.584 

0.544 

0.692 

0.817 

0.804 

0.839 


certain edge list occurring] Therefore, simply dividing the 
edge list into training and test sets would create a test set 
that only consists of one label (i.e. existence of a link be¬ 
cause the data is an edge list), which makes it difficult to use 
traditional measures such as area under the reciver operator 
curve (AUC), which requires both 0 (i.e. nonexistence of a 
link) and 1 (i.e. existence of a link) labels in the test set. 
Thus, to evaluate the predictive performance of these two 
model types, we first define a score function to compare the 
models without adjusting the test set data. 

For all possible links given a set of nodes, we evaluate 
the probability of a link being connected (for MMSB-type 
models) or the event probability that a link is generated 
from all possible links (for SB-type models). We then rank 
each possible link in decreasing order of probability. The 
average rank in the test edge lists is used as the evaluation 
score. 

The predictive performance of the unseen edge list is eval¬ 
uated by dividing the edge list into 10 sets. For each set, we 
train the MMSB model, RTM, SB model and InfSBT model. 
The number of groups is set to 16 in all models except In- 
fSBT. For MMSB and RTM, we use the codes provided by 
the author of RTM |6, 7]. We also compare the performance 
with the null model in which the probability of each link 
is randomly ordered. The first row of Table [5] reports the 
average score from the 10 sets. It is apparent that, without 
additional sparsity constraints, the SB model outperforms 
the MMSB model and RTM. 

To compare the performance of the model using tradi¬ 
tional measures (i.e., AUC), for each training and test set 
pair, we modify the test dataset in the following way. For 
each test dataset, we randomly add 500 node pairs without 
links. For this modified dataset, we calculate the ROC curve 
using each model and calculate AUC for each of the models 
using a different split of the dataset and calculate its mean 
value. Table [6] reports the result. As before, we observe that 
InfSBT outperforms other methods. 

Word prediction is performed in the usual way. We first 
randomly divide the word list into a training set (90%) and 
test set (10%), and ensure that each node has at least one 
word in the training set. We compare the predictive perfor¬ 
mance of the test set in terms of its predictive log-likelihood. 

3 Hence, in the sparse block-type models, the same edge 
could occur more than once. This feature might be prob¬ 
lematic when the network is dense, but this is not the case 
for the datasets used in this paper. 






















Table 7: Predictive log-likelihood of the test word list 


Data 

LDA 

RTM 

InfSBT 

Synthetic 
Real firm 

-81.17 

-638.41 

-88.58 

-661.83 

-47.61 

-603.64 


For this task, we compare the proposed model to LDA and 
RTM. The first row of Tabled shows the results. It shows 
that taking network information into account results in bet¬ 
ter predictive performance in terms of the unseen words. 

5.2 Interfirm buyer-seller network 

We also apply the InfSBT model to a real-world interfirm 
buyer-seller network. The network data are obtained from a 
data provider who collects interfirm buyer-seller information 
about Japanese firm^ In the dataset, each firm is requested 
to name up to five buyers of their products and suppliers of 
the intermediate goods used in their own production. This 
scheme corresponds to the fixed rank nomination scheme in 
social network analysis (c.f. a friendship network). The fixed 
rank nomination scheme has both advantages and disadvan¬ 
tages. The advantage is that we can focus on the major 
relationships operating in an economy. The disadvantage 
is that minor interdependence of industries is prone to be 
omitted from the estimate. 

We use a subset of these data from the accounting year 
2012 and focus on medium-sized firms and their surround¬ 
ings. The resulting network includes 222 firms. For each 
firm, we obtain node textual information, written in Japanese, 
that describes its main business lines. We first parse this in¬ 
formation using a morphological analysis technique 17 and 
select a word list containing only nouns. We also delete 
several stop words. 

To be more precise about the node textual information, 
it is a short text that describes the main business lines of a 
company. If this information is well organized with a harmo¬ 
nized code, group labels might be easier to determine, but 
this is rarely the case. Moreover, there are firms that con¬ 
duct unusual business that cannot be fully captured using 
only industrial classification (group labels). Furthermore, 
although we did not use it in this paper, there is additional 
textual information that provides an overview of the com¬ 
pany. Further research should include this. Thus, in the 
paper, we choose to model textual information rather than 
group labels. 

MCMC is run for 2,000,000 iterations and we use the final 
realization for our network visualization. The color and posi¬ 
tion of each industry are determined using polar coordinates. 
Figure [3] reports the underlying block structure estimated 
from our model and shows that our proposed method suc¬ 
cessfully determines the underlying block structure. There 
is a general construction business in the upper-right part 
of the network that purchases goods from other businesses 
(e.g., hardware, concrete, machinery, wood, glass, interior, 
cargo services and pipes). There is another input-output 
relationship that describes the wholesale business centered 
on concrete that is sold to joinery and interior businesses. 

Compared with the input-output table, the estimated in¬ 
dustry structure is sparser. In the input-output table, there 

4 Tokyo Shoko Research Ltd. 


are basic industries, such as electricity real estate, banking 
and office supplies, that supply all other industries, which 
makes the interdependence of industries rather dense |20| . 
This is expected because of the use of network data with a 
fixed rank nomination scheme, where each firm only nomi¬ 
nates up to five major buyers and sellers. This might not 
be a problem when one is interested in the major relation¬ 
ships and omits minor relationships (e.g., firms buying sta¬ 
tionery goods from a stationery store). However, in the tra¬ 
ditional input-output table, one focuses more on what goods 
is bought to produce a particular goods, which makes it pos¬ 
sible to take these types of minor relationships into account. 
For our approach to be able to incorporate these minor re¬ 
lationships, more data close to the ideal data in Table |T] is 
required. Despite the fact, the success of estimating a mean¬ 
ingful structure from the approach creates the opportunity 
for further extension that uses more elaborate models that 
could exploit multi-source datasets. Further direct compari¬ 
son between our estimate and the input-output table would 
be provided in the next subsection. 

Predictive performance We also compare the predic¬ 
tive performance for these data. The bottom row of Tables 
[5] and [5] reports the results for edge list prediction and shows 
that the proposed model outperforms all other models. One 
reason for the poor performance of RTM may be that the im¬ 
plementation provided by J. Chang and M. J. Chang 0 only 
models block diagonal interactions (i.e., assortative commu¬ 
nities), whereas the networks studied in this paper display 
a more disassortative nature 
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The word prediction performance was also evaluated for 
this dataset. The bottom row of Table 7 reports the results, 
which show that, as with the synthetic dataset, InfSBT out¬ 
performed all the other models. 


5.3 Comaprison to the traditional input-output 
table 

Compared to the traditional approach, the main feature 
of the presented approach lies in being able to directly assign 
each interfirm link an interindustry pair. Thus, it is possible 
to look up the list of firms classified in an industry or list of 
links classified in an interindustry pair in a direct manner. 
By design, the traditional approach ignores the interfirm 
relations and could not perform this direct lookup making 
the proposed approach complementary. 

Putting aside this microlevel aspect, how similar or dif¬ 
ferent are the interindusty structure estimated by the two 
approaches? We present further qualitative comparison to 
answer this question. 

Figure 4a shows the regional input-output table of the 
Nemuro-Kushiro area. The Nemuro-Kushiro area is located 
at the east part of Hokkaido, Japan. Only top edges sorted 
by their strength of relations are shown to aid visual coma- 
parison. We see that fishery, crop farming and construc¬ 
tion is a major industry in this area showing strong linkage 
among the industry pairs. Figure 4b shows the output us¬ 
ing the presented approacl^] Four things are worth men¬ 
tioning. First of all, suspecting the topic words, we could 
confirm that fishery, food and construction are major indus¬ 
tries in this area. It is worth emphasizing that this result 
was achieved using completely different data set (as summa¬ 
rized in Table 2 and Table 3) and methodology compared to 

5 We selected 3,514 firms located in this area from the same 
data set used throughout the paper. 










the traditional approach. Secondly, we see that the network 
structure of the construction sector is more complicated than 
figure 4a suggests having multiple grups classified as a con¬ 
struction sector. This is not surprising because in Japan 
there are a lot of small sized firms involved in the construc¬ 
tion business connected in a complicated manner. By ex¬ 
ploiting the micro level interfirm network data, we are able 
to separate the construction sector into distinct groups pro¬ 
viding a more detailed insights into the grouping structure 
of firms in a network. Thirdly, we see that hotel and pension 
is one of the major service sector in this area which could 
be confirmed by the fact that Hokkaido is one of the main 
sightseeing spot in Japan. Finally, due to the fact that the 
presented approach only uses interfirm buyer-seller relations 
as the main network data, the presented approach fails to 
take into account the public and finance sector. These two 
sectors are present in Figure 4a showing that the two ap¬ 
proach shows complementary insights. 


6. CONCLUSIONS 

Motivated by the practical problem of estimating the in¬ 
terdependence of industries in an economy, this paper first 
introduced InfSBT, a Bayesian nonparametric model of net¬ 
work formation that can (i) jointly model sparse network 
information and node textual information and (ii) jointly 
estimate the underlying latent block structure and number 
of components required to sufficiently represent the topol¬ 
ogy of a network. The model is an extension of a previous 
SB model 
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which jointly models node textual informa¬ 
tion and an unbounded number of industries and interac¬ 
tions among them. The second aspect of the model was 
determined by defining a prior distribution that can process 
infinite mixtures in the network model. For this task, we 
introduced the two-dimensional Chinese restaurant process, 
which builds on its famous one-dimensional counterpart. We 
showed that, with sufficient approximation, the joint distri¬ 
bution derived from this process could be successfully used 
to define a model with an unbounded number of compo¬ 
nents. We tested the model using both synthetic and real 
datasets. 

By using the strength of dimensionality reduction, we de¬ 
termined the underlying interdependence of industries in a 
real-world network and outperformed previous models in 
predictive tasks. There are other types of model that also 
use a dimensionality reduction approach to link formation 
14 . However, this type of latent space model cannot pro¬ 
vide an interpretable summary of the underlying block struc¬ 
ture and instead displays each node in an abstract space. 
The proposed model provides a concise summary of the un¬ 
derlying block structure, as shown in Figure [3] Other works 
that jointly model the network link structure and node at¬ 
tributes focus on learning the node labels and are essentially 
not dimensionality reduction techniques [21[ |4], Hence, they 
cannot provide interpretable clusters and cannot respond to 
the underlying motivation of this paper. 
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(c) Infinite SBT 


Figure 1: Plate diagram 
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Figure 2: Schematic figure describing the 2DCRP. (a) describes the state of the interindustry network at time t — 1. (b-d) 
descibes the three type of cases which might follow in the next time step. 
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Figure 3: Network block structure estimated using InfSBT. Node size is adjusted by their number of links. 
































































(a) Network plot showing the traditional input-output table of the 

Nemuro-Kushiro area. For ease of comaprison only strongly con- (b) Network plot showing the the estimated interindustry structure 
nected edges are depicted. The size of the link reflects its strength, of the Nemuro-Kushiro area using InfSBT. 

Figure 4: Comparison of our proposed approach to the traditional input-output table. Node size is adjusted by their number 
of links. 










